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ABSTRACT 

The availability of an overwhelmingly large amount of bib¬ 
liographic information including citation and co-authorship 
data makes it imperative to have a systematic approach that 
will enable an author to organize her own personal academic 
network profitably. An effective method could be to have 
one’s co-authorship network arranged into a set of “circles”, 
which has been a recent practice for organizing relationships 
(e.g., friendship) in many online social networks. 

In this paper, we propose an unsupervised approach to au¬ 
tomatically detect circles in an ego network such that each 
circle represents a densely knit community of researchers. 
Our model is an unsupervised method which combines a va¬ 
riety of node features and node similarity measures. The 
model is built from a rich co-authorship network data of 
more than 8 hundred thousand authors. In the first level 
of evaluation, our model achieves 13.33% improvement in 
terms of overlapping modularity compared to the best among 
four state-of-the-art community detection methods. Fur¬ 
ther, we conduct a task-based evaluation - two basic frame¬ 
works for collaboration prediction are considered with the 
circle information (obtained from our model) included in 
the feature set. Experimental results show that including 
the circle information detected by our model improves the 
prediction performance by 9.87% and 15.25% on average in 
terms of AUC (Area under the ROC) and Prec@20 (Pre¬ 
cision at Top 20) respectively compared to the case, where 
the circle information is not present. 

1. INTRODUCTION 

Now-a-days, public repositories of bibliographic datasets 
such as DBLP and Google Scholar allow us access to a 
stream of scientific articles published by authors from differ¬ 
ent domains. An author, we wish to analyze, might be asso¬ 
ciated with overwhelming volumes of information in terms of 
her collaborations and publications, which in turn leads to 
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both information overload and high computational complex¬ 
ity. Moreover from an author’s perspective, it could become 
painstakingly difficult to keep track of the entire set of aca¬ 
demic relationships she has with her collaborators at any 
point of time. 

Present Work: Problem definition. In this article, 
we study the problem of automatically discovering an au¬ 
thor’s academic circles. In particular, given a single author 
with her co-authorship network, our goal is to identify her 
circles, each of which is a subset of her coauthors. Some 
examples of real-world circles in an author’s co-authorship 
network are shown in Figure [T] The “owner” of such a net¬ 
work (the “ego”) may wish to form circles based on common 
bonds and attributes among her coauthors (the “alters”). 
An author could have several reasons behind initiating a 
new collaboration. Some common tendencies exhibited by 
authors include collaborations with the people from her own 
Institute or with people sharing the same research interest 
with her. Therefore, the problem of deciding upon a sin¬ 
gle dimension to both characterize the circles and categorize 
the coauthors appropriately becomes extremely challenging. 
Moreover, circles are author-specific, as each author orga¬ 
nizes her personal network of coauthors independent of all 
other authors with whom she is not connected. This leads to 
a problem of designing an automatic method that organizes 
an author’s academic network, more precisely, categorizes 
her surrounding neighborhoods into meaningful circles. 
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Figure 1: (Color online) A hypothetical example 
showing an ego network of an author u with labeled 
circles. Alters might belong to multiple groups and 
form overlapping circles. 






Present Work: Motivation of the work. The problem 
of detecting ego-centric circles in a co-authorship network 
is useful in many aspects. The collaborators of a particular 
researcher might have interests aligned with different topics, 
and the set of collaborators the researcher is currently work¬ 
ing with is a reflection of her current topic of interest. Thus, 
by understanding circles around her co-authorship network, 
she might discover that she might be interested in reading 
papers about a certain topic that she has not been inter¬ 
ested in before. This result in turn helps in personalized 
paper recommendations. On the other hand, if one is inter¬ 
ested to start a new collaboration in a particular field with 
very famous researcher (usually having less opportunity for 
new collaboration), a more successful attempt could be to 
first establish a collaboration to one of the coauthors of the 
famous researcher who happens to belong to a circle that is 
most aligned to the field of interest. Therefore, the circle 
information could lead to the design of a meaningful collab¬ 
oration prediction system. Moreover, one can also discover 
the collaboration pattern of a researcher by observing the 
temporal evolution of the ego-centric circles of an author. 

Present Work: An unsupervised approach for cir¬ 
cle detection. In this work, we propose an unsupervised 
method to learn the major dimensions of author profile sim¬ 
ilarity that lead to densely linked circles. In practice, since 
the topological evidence in such small ego networks is less, 
the traditional community finding algorithms fail to discover 
meaningful circles from it [ISIITS]. Here, we intend the fol¬ 
lowing two conditions to be satisfied during circle detection: 
first, we expect the circles to be formed by densely con¬ 
nected sets of alters. Different circles might overlap, i.e., 
alters might belong to multiple circles simultaneously. Sec¬ 
ond, we expect that the members of the same circle share 
common properties or traits [18]. We model the similarity 
between alters as a function of common profile information. 
We then propose an unsupervised method to learn precisely 
which dimensions of profile similarity lead to densely linked 
circles. In each iteration, our model tries to learn the con¬ 
nectivity between alters from the actual graph and updates 
the circle memberships accordingly. Once the optimal con¬ 
dition is encountered, the model outputs the circles. We 
make our experimental codes available in the spirit of repro¬ 
ducible research: http: // cnerg. org/circle 

Summary of the evaluation. The entire experiment is 
conducted on a massive dataset of computer science domain 
constituting more than 8 hundred thousand authors. Some 
interesting observations from the extensive analysis of the 
detected ego-centric circles are as follows: (i) the highly- 
cited authors tend to form more number of large and highly 
cohesive circles, (ii) the highly-cited authors seem to coau¬ 
thor with a group of people having a specihc research inter¬ 
est in a particular time period and then leave this group to 
form another such group of coauthors; (iii) the highly-cited 
authors tend to spawn circles that have alters in very similar 
fields, whereas authors with medium-citations spawn more 
diverse circles. To evaluate the quality of the detected cir¬ 
cles, we compare our model with four state-of-the-art over¬ 
lapping community detection algorithms in terms of stan¬ 
dard overlapping modularity measure and achieve an im¬ 
provement of 13.33% over the best baseline method. Fur¬ 
ther, we conduct a task based evaluation where we show that 


including the circle information detected by our model in the 
feature set improves the performance of the existing collab¬ 
oration prediction models (liner regression and supervised 
random walks) by 9.87% and 15.25% respectively in terms 
of AUC (Area under the ROC curve) and Prec@20, com¬ 
pared to the case where the circle information is not present. 
With respect to the best baseline which gives the circle infor¬ 
mation, our model achieves average improvement of 3.35% 
and 6.26% respectively in terms of AUC and Prec@20. 

2. RELATED WORK 

We broadly divide the related work into two subparts: re¬ 
search on the ego structure of a co-authorship network and 
research on discovering local circles in an ego network. 

Research exploring ego structure in co-authorship 
network. One of the most interesting yet curiously un¬ 
derstudied aspects is the analysis of the structural proper¬ 
ties of the ego-alter interactions in co-authorship networks. 
Eaton et al. [S] found that the productivity of an author 
is associated with centrality degree confirming that scien¬ 
tific publishing is related with the extent of collaboration; 
Borner et al. |6| presented several network measures that in¬ 
vestigated the changing impact of author-centric networks. 
Yan and Ding m analyzed the Library and Information 
Science co-authorship network in relation to the impact of 
their researchers, finding important correlations. Abbasi et 
al. extensively studied the relationship between scientific 
impact and co-authorship pattern, discovering significant 
correlations between network indicators (density and ego- 
betweenness) and performance indicators such as g-index 
[T] and citation counts [2]. McCarty et al. [I7| attempted to 
predict the h-index evolution through ego networks, observ¬ 
ing that this factor increases if one can choose to coauthor 
articles with authors already having a high h-index. 

Research on discovering local circles in ego network 

McAuley and Leskovec |151116| were the first who explored 
social circles in ego networks. They mapped this problem 
as a multi-membership node clustering problem and devel¬ 
oped a model for detecting circles that combines network 
structure as well as user profile information from Google-|-, 
Facebook and Twitter. They remarked that these local cir¬ 
cles can not be discovered using traditional community de¬ 
tection algorithms m because of the dearth of information 
on topological structure in the ego network of each author 
(9] [16]. According to them, under such circumstances topic¬ 
modeling techniques [3] [5] are the best to uncover “mixed 
memberships” of nodes to multiple groups. This, to the best 
of our knowledge, is the first attempt to detect local circles 
(groups of coauthors with similar features) centered around 
each ego/author in a co-authorship network and to use this 
information further to enhance the performance of existing 
collaboration prediction models. 

3. AN UNSUPERVISED MODEL FOR DIS¬ 
COVERING EGO-CENTRIC CIRCLES 

Our model for detecting ego-centric circles applies to any 
general ego network, where each node is considered as an 
ego and the set of her one-hop neighbor nodes constitute 
the set of alters. The ego is said to spawn the ego network, 
but is not considered as a part of the network. Our method 


intends to discover circles in this ego network in an unsuper¬ 
vised fashion, leveraging properties specific to nodes as well 
as properties of the network. Our model requires each node 
to have a profile, which is essentially the feature vector char¬ 
acterizing the node in a feature space. Two nodes are said 
to be similar if their feature vectors are similar, as evaluated 
by an appropriate similarity metric. Although exact profile 
details and the similarity metrics will vary depending on the 
nature of the network, some general assumptions made by 
our model are as follows: 

• Alters of the same ego, that have similar profiles should 
be in the same circles while those with dissimilar profiles 
should be in different circles. 

• Alters that share an edge are more likely to be part of the 
same circle than disconnected alters. 

• While it should be possible to label each circle by some 
common property of its member nodes, a circle may actu¬ 
ally have more than one label. In our earlier example of 
a co-authorship network, two or more circles may contain 
authors from the same field, but may be different in some 
other attribute such as the authors from the same Institute 
as shown in Figure [T] 

• Circles may overlap and may even contain other smaller 
circles. 

We now describe the algorithm for circle formation in 
more details. The input to our algorithm is an ego network 
G =< V, E >. Each node v £V has an A^-dimensional pro¬ 
file vector = {/i„, / 2 „, /s^, ..., /nv}, where /i„ denotes 
the value of the feature of the node v. The ego node 
u, often referred to as the center node, is responsible for 
spawning the ego network, but does not itself feature as a 
part of the network. So the ego network of u is essentially 
the subgraph induced by the alters of u. Let D{x,y) be the 
Euclidean distance between the profile vectors of nodes x 
and y given by Equation [T] 


D{x,y) = D{y,x) 
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terms of their circle membership, denoted by j3{x,y). Let 
pi{x,y) and P 2 {x,y) be defined as follows. 


Pi (x,y) 

= (Sim{x,y) - Tj X) ^ 

(3) 


Cj:{x,y}CCj 


P2(x,y) 

= Y {Sim(x, y) - Tj + A)"^ 

(4) 


Cj-.{x,y}<^Cj 



Note that {x, y} C Cj if both x and y are members of 
the circle Cj, while {x,y} ^ Cj if Cj does not contain one 
or both of X and y. The constant A is kept large enough 
to ensure that no term in the summation is negative and 
may simply be taken as the maximum of all threshold val¬ 
ues, i.e., max{Ti,T 2 , ...jTk}- Note that fii{x,y) is high if x 
and y share common circles with very high thresholds, while 
I32(x, y) is high if x and y do not share common circles with 
high thresholds. 

Now, we define the closeness estimator j3{x,y) as follows. 

P{x,y) = exp{[^i(a;,i/)]^ - [I32{x,y)f} (5) 

Note that /3(x, y) is purely a circle-membership based sim¬ 
ilarity metric for the pair (x,y), and increases with increase 
in the number and threshold values of the common circles 
which X and y are part of. Thus, the closeness estimator em¬ 
phasizes not only the common circle memberships of nodes 
but also the thresholds of the circles they are part of. 

From the closeness information so estimated, the prob¬ 
ability that the pair (x, y) forms an edge in G is modeled 
by: 

Similarly, for the node-pair (x, y) which does not belong 
to E, the probability is estimated as follows: 


The aim of the method is to identify a set of circles C= 

{Cl, C2,. ,Ck}- Given a circle Cj G C and a node y £V, 

we define the distance of y from Cj, say D'(Cj,y), as the 
average distance of y from all other nodes in Cj. Also, the 
profile similarity measure between a pair of nodes x and 
y, denoted by Sim{x, y) is defined to be the reciprocal of 
D{x,y). Analogously, the similarity between node y and 
circle Cj, denoted by Sim' (Cj, y) is defined to be the recip¬ 
rocal of D'(Cj,y). We shall demonstrate the merit of this 
profile similarity measure in Sections |4] and [b] 

Each circle Cj in our model has a similarity threshold 
parameter Tj associated with it such that if node 1 / G V is 
in Cj then the following constraint is satisfied: 

Sim'(Cj,y)>Tj (2) 

Based on our assumption that nodes within a common 
circle at any point of time have a higher probability of form¬ 
ing an edge in the network, our model predicts the circles 
estimated at each step to be cliques, and distinct circles not 
to share any edge at all. Given a set of K circles C= (Gi, 

C2, . ,Ck}, along with a set of threshold parameters r = 

(ri, T2,...,tk} in any iteration of the algorithm, we define 
a closeness estimator for a pair of nodes (x,y) £ V x V in 


Pi{x,y) ^ E) = l-p((x,y) € E) = (7) 

Quite evidently, p{x,y) increases with increase in I3{x,y) 
and is normalized using add-one smoothing m- Thus we 
get a predicted probability of existence for each possible edge 
in the network given C and r. The rationale underlying the 
prediction is that the closeness of a pair of nodes (x, y) is 
proportional to the similarity of their profiles as well as the 
number and similarity thresholds of common circles that 
they are a part of. Now the model must ensure that this 
predicted network indeed corresponds to the real network, 
for which we present the following analysis. 

Assuming independent generation of each edge in the graph, 
the joint probability of G and C can be written as 

Pf{G-,C)^ n P{{^,y)&E) n Piix,y)^E) ( 8 ) 

{x,y)eE (x,y)^E 

We define the following notation [9] for ease of expression: 
(j){x,y) = log(^(x,i/)) = {[Pi{x,y)f - [/32(x,y)]^) (9) 








Taking logarithm of Equation [8l and using notation [9] we 
can express the log likelihood of G given C and f as: 

= log(P^(G;C')) 

= E log(p((a:,y) e E)) + ^ log (p((x, y) ^ E)) 

{x,y)^B (x,y)^B 

= E log {0{x,y)) - E log(l + p(x,y)) 

{x,y)^B {x,y)^VxV 

= E <t‘{x,y)- ^ log(l + exp{0(a:, y)}) 

(.x,y)&E (x.MleVxv 

. ( 10 ) 

The model thus attempts to identify a set of circles C that 
maximizes lf{G;C). In Section [4] we describe how this may 
be achieved by optimizing f. Also, in Section (6] we describe 
how this generic model can be applied to co-authorship net¬ 
works in particular. 

4. UNSUPERVISED LEARNING OF MODEL 
PARAMETERS 

In this section, we describe the method used to find the 
set of circles C by maximizing the log likelihood in Equation 
nUl Algorithm [T] summarizes the steps of a single iteration 
of the algorithm. 

Initially, each node is in a different circle with a very high 
threshold value. At each iteration t, for each node y £ V 
we alter the circle membership of y by randomly adding it 
to some circles it previously did not belong to and deleting 
it from some circles it belonged to. This is similar to the 
concept of perturbing the solution in simulated annealing 
|14 |. The circle thresholds are then updated accordingly 
such that the constraint in Equation [2] is not violated. 

The general idea is that larger the number of circles a node 
y is already part of after time step t, lesser is the extent to 
which the circle membership of y is disturbed in time step 
t + 1. 

We denote by Gt the set of circles and by A the cor¬ 
responding set of thresholds after time step t, where Ct 
= {Ci{t), C 2 (t),...,C'K(t)} and n = {n(t), T 2 {t),...,TKit)}. 
Also, let the log likelihood of G given Ct and ft be It{G', Ct)- 
The following are the main steps of the algorithm to update 
the circle in time step t + 1: 

Step 1: For each node y £V, we capture the circle mem¬ 
bership of y at time t by defining two sets S'ly.t and S2y^t'- 


Sly,t = {Cjit)\Cjit)eCtAy£C^it)} ( 11 ) 

S2y,t = {Cj{t)\Cj{t)£CtAy^Cj{t)} (12) 


Step 2: Now we intend to compute the number of circles 
to add y to and to remove y from, given by the two variables 
- AddCircle{y,t -f- 1) and RemoveCircle{y,t + 1): 

'Kl + \Sly,t\ 


AddCircle{y, t + 1) = 


RemoveCircle{y,t -1-1) = 


l'S'ly,t| 

K2+\Sly,t\ 

I'S'ly.tl 


(13) 


(14) 


Here, ifl is a randomly chosen integer with 1 < K1 < 
|S'2y,t|, such that the value of AddCircle{y,t + 1) is less 
than or equal to \S2y^t\, i.e., the number of circles that y 
is currently not part of. Similarly, K2 is a randomly cho¬ 
sen integer with 1 < K2 < |Sly,t| such that the value 
of RemoveCircle{y,t + l) is less than or equal to |51j,,t|. 


i.e., the number of circles that y is currently part of. Note 
that both AddCircle{y,t + 1) and RemoveCircle{yR + 1) 
are low for high values of |S'lj^_t|. This ensures that the more 
the number of circles y is currently part of, lesser is the dis¬ 
turbance to the circle membership of y (and vice versa). 

Step 3: Add y to AddCircle{y, t+1) many randomly cho¬ 
sen circles from S2y^t and remove y from RemoveCircle{y, t+ 
1) many randomly chosen circles from Sly^t- The corre¬ 
sponding circles are updated accordingly. 

Step 4: Once Steps 1, 2 and 3 are over for each node, we 
have the set Ct+i containing the augmented circles. Next, 
we update the corresponding thresholds by setting Tj{t -|- 1) 
corresponding to the circle Cj{t + 1) to the minimum value 
such that for each node y £ Cj(t + 1) the constraint in 
Equation [2] is not violated. Thus the updated Tj{t -f 1) for 
Cj {t -|- 1) is given by: 

Tj{t -I- 1) = min{Sim'{Cj{t -|- l),y)\y £ Cj{t + 1)} (15) 

Step 5: If the threshold rj{t + 1) for Cj(t+ 1) falls below 
a constant lower limit tl, we discard Cj{t + 1). The value of 
tl is empirically determined. In our experiments, we tested 
over a wide range of tl and set it to 0.2 for best results (see 
Figure [S]) . 

Step 6: We then compute the log likelihood /ft+i (G; Gt-i-i) 
using Equation [To] If (G; Gt+i) > lft{G\Ct), then re¬ 
tain newly computed sets Gt+i and ft+i; else set Gt+i = Ct 
and ft+i = ft. 

The process continues till we reach a maxima and the log 
likelihood does not increase any further for sufficiently many 
iterations. We then report the set of circles so obtained as 
the optimal set of circles. Note that the maximum number 
of circles after any iteration of the algorithm is |E| and the 
maximum number of nodes in any circle is also \V\. So the 
running time of each iteration of the algorithm is 0(|E| + 
|Gt|) = 0(|E|). Also, any change to the set of circles is 
accepted only if the overall likelihood increases and so the 
method converges to a local maxima after a finite number 
of steps. For practical applications, the method is assumed 
to reach a local maxima if the likelihood function does not 
increase for \V\ iterations. 

5. A LARGE PUBLICATION DATASET 

We have crawled one of the largest publicly available data¬ 
sets from Microsoft Academic Search (MAS) which houses 
over 4.1 million publications and 2.7 million authors. We 
collected all the papers specifically published in the com¬ 
puter science domain and indexed by MAS. The crawled 
dataset contains more than 2 million distinct papers by more 
than 8 hundred thousand authors, which are further dis¬ 
tributed over 24 fields of computer science domain. The 
co-authorship network constructed from this dataset has au¬ 
thors as nodes and edges between authors who have written 
at least one paper together. 

Ego network: The next step is the construction of ego 
networks from the co-authorship network. We consider the 
ego networks corresponding to each node (author) present 
in our dataset, thus obtaining 821,633 ego networks. An 
illustrative example of an ego network is shown in Figured) 
However, in this experiment we consider only the induced 








Algorithm 1 Iteration for Updating Circles 

1: 

procedure CiRCLEUPDATE(t, Ct, ft, Z.f^ (C?; Ct)) 


2: 

Ct+I ^ Ct 


3: 

ft+1 ft 


4: 

for all y G V do 


5: 

siy.t = {CAi)|CA‘) e Ct A y e c^{t)} 


6: 

S2y,t = {CAi)|CA‘) e Ct A y ^ 


7: 

K1 <— random{l, |S2j/,t|) 


8: 

K2 ■<— random{l, |Slj/,t|) 


9: 

AddCircle{y,t 1) j"—| ^ j 


10 

RemoveCircle{y,t ^ 


11 

Randomly choose Cac> Orc- 


12 

Cac C S2y^t, \Cac\ = AddCircle{y,t + 

1) 

13 

for all Cj{t) G Cac do 


14 

Cj (t + 1) Cj (i + 1) U {y} 


15 

Crc C Sljt.t. \Crc\ = RemoveCircle{y 

t-ti) 

16 

for all Cj{t) G Crc do 


17 

Cj(t + !)■<— Cj{t + 1) \ {y} 


18 

for all Cj{t + 1) G Ct+i do 


19 

Tj{t + !)■<— min{Sim'{Cj{t + 1), y)\y G 

Cj (t + 1)} 

20 

if Tj{t + 1) < Ti;, then 


21 

Ct+I ^ Ct+1 \ {Cj{t + l)} 


22 

n +1 <-tt +1 \ {rj(t-1-1)} 


23 

Compute (G; Ct 4 - 1 ) [Eg. IIQI 


24 

if ifj^j(G;Ct+i) > 4t(G;Ct) then 


25 

Return {Ct+i,ft+i} 


26 

else 


27 

Return {Ct, ft} 



subgraph of the alters for an ego and exclude the ego and its 
attached edges from the ego network, as mentioned earlier. 

6. FEATURE EXTRACTION 

Profile information of each author node in the ego net¬ 
work is represented as a feature vector consisting of a set 
of features. These features can be divided into two broad 
categories - general and ego-centric features. Having these 
two separate categories, the feature set emphasizes the fact 
that members of common circles should not only have high 
feature similarity with each other but also share similar re¬ 
lationships with the ego. 

Given an author x with all her publications, and the set of 
fields of research F = {ri,r 2 ,. ,7'24}E1, we define the ver¬ 

satility vector V{x) of an author x as {ri^x',ri £ F} such 
that ri^x is the fraction of publications of x in field ri. Also, 
given a set of decades DEC = {1960-1970, 1971-1980, 1981- 
1990, 1991-2000, 2001-2009}, we define the persistence vec¬ 
tor D{x) for X as {dj>; 1 < J < 5}, where denotes the 
number of papers published by x in decade DEC{j). We 
also define the major field of work R{x) for x, where she has 
maximum number of publications. 

The general features capture independent characteristics 
of each author in the ego network and are listed below: 

• The normalized number of citations the author has re¬ 
ceived (size 1) 

• The normalized number of citations per paper that the 
author has received (size 1) 

• The normalized h-index of the author (size 1) 

• The normalized number of coauthors of the author (size 

1 ) 

• The versatility vector of the author (size 24) 

• The normalized number of papers written by the author 
(size 1) 

^Note that there are 24 research fields present in our dataset. 


• The persistence vector of the author (size 5) 

• The major field of the author (size 1) 

On the other hand, the ego-centric features capture the 
relationship of an alter with its ego. Such features include: 

• The fraction of papers coauthored by the alter with the 
ego in each of the five decades (size 5) 

• The fraction of papers coauthored by the alter with the 
ego in each of the 24 fields (size 24) 

• The normalized number of common coauthors that the al¬ 
ter has with the ego (size 1) 

• The fraction of papers authored by the alter in the major 
field of the ego (size 1) 

• The fraction of papers authored by the ego in the major 
field of the alter (size 1) 

Thus the dimension of the feature vector containing all 
the above listed features is 67. Using the profile informa¬ 
tion for each node, our model computes the probability of 
edge existence between each pair of nodes {x,y), given by 
p{x,y) as described in Equation |6] We calculate this prob¬ 
ability from the extent of similarity of node-pair {x,y), i.e., 
Sim(x, y). In order to verify that the node similarity indeed 
helps identify edges between similar authors with high prob¬ 
ability of collaboration, we perform two small experiments. 
We first check the conditional probability that given a node¬ 
pair {x,y) with similarity Sim{x,y) = Wxy in G{V,E), the 
node-pair indeed materializes as an edge in the real network. 



Figure 2: (a) Conditional probability of edge exis¬ 
tence between authors with a given similarity Wxy 
between their profiles; (b) actual number of edges 
with a given edge weight in the network. 

Plot shown in Figure [2}a) confirms that our similarity 
measure is indeed proportional to the conditional probability 
of edge existence. We also observe the number of edges in a 
network having a particular edge weight Wxy in Figure[2}b). 
Most of the edges are in the range 0.55 — 0.75, indicating 
that this is the most common profile similarity range among 
pair-wise authors. Very few edges exist in the range 0 — 
0.3, which indicates that collaboration between authors with 
very dissimilar profiles is quite rare. This value also seems 
quite low for the range 0.9 — 1.0, which might be due to the 
fact that it is extremely rare to have authors with nearly 
similar profiles. 

7. EVALUATION OF DETECTED CIRCLES 

In this section, we intend to evaluate the quality of the 
circles detected by our proposed methodology. Evaluation 
is especially important to judge the quality of the detected 
circles. We compare the circles detected by our model with 
that obtained from four other recent overlapping community 
detection algorithms, namely BIGCLAM |22| . SLPA |20 |. 
















No of citations per author 


Figure 4: (Color online) Author-specific characterization of detected ego-centric circles. The following are 
plotted against the number of citations per author : (a) the number of circles centered around an author, 
(b) the average size of the circles centered around an author, (c) the number of ego-centric circles an author 
is a part of and (d) the average cliquishness of the circles centered around an author. 
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BIGCLAM 

0.60 

SLPA 

0.56 

OSLOM 

0.59 

COPRA 

0.58 

Field based 

0.45 

Coordinate ascent 

0.64 

Our method 

0.68 


Figure 3: (Left) Change in overlapping modularity 
Qov with the increase in tl', (Right) comparison of 
the baseline algorithms with our method. 


OSLOM | 13| and COPRA [TO]. We also detect the circles 
using the coordinate ascent method (CA) |15[I16| . Since we 
intend to show that research field of the anthors is not the 
proper information for creating the circles, we also compare 
our output with the circles obtained simply from research 
fields. For comparison, we use overlapping modularity Qov 
[n] which is probably the most widely used measure for 
evaluating the goodness of a community structure without 
a ground-truth. 

First, to show the change in Qov with respect to the 
threshold tl as described in Section [d] we plot this qual¬ 
ity function in Figure |3] by varying tl from 0.05-0.5. We 
observe that Qov reaches maximum at tl = 0.2. Then for 
each competing algorithm, we measure the value of Qov for 
each ego and take an average over all the egos present in 
our dataset. The table adjacent to Figure [3] shows that our 
method outperforms the traditional topology based commu¬ 
nity finding algorithms in detecting meaningful circles. Our 
method achieves Qov of 0.68 which is 6.25% higher than 
coordinate ascent method, 13.33% higher than BIGCLAM, 
15.25% higher than OSLOM, 17.24% higher than COPRA, 
and 21.42% higher than SLPA. We notice that research field 
based circles are the worst among the detected circles (see 
Section [9] for more discussion). 

8. ANALYSIS OF EGO-CENTRIC CIRCLES 

In this section, we intend to characterize the ego-centric 
circles obtained from our unsupervised model. In particular, 
we study the properties of the ego-centric circles at two lev¬ 
els of granularity: author-specific analysis and circle-specific 
analysis. 

8.1 Author-specific Analysis 

Here we study how the circles in the ego networks of the 
highly-cited authors differ from those of the low-cited au¬ 
thors. Figure |4|a) shows the number of ego-centric circles 


appearing in the ego network of each author. Note that 
in all of the experiments, we categorize authors into three 
groups: authors receiving more than 100 citations as highly- 
cited authors (proportion: 5.21%), authors receiving cita¬ 
tions between 30-100 as medium-cited authors (proportion: 
28.75%), and authors receiving less than 30 citations as low- 
cited authors (proportion: 66.04%). We notice a rise in the 
number of circles with the increase of citations. The pos¬ 
sible reason could be that since the authors accumulating 
high citations tend to have high number of collaborators, 
the number of alters in their ego networks is also high, and 
thus more number of ego-centric circles are detected for the 
highly-cited authors. 

In Figure|4(b), we plot the average size (measured in terms 
of the number of nodes) of the ego-centric circles for the 
authors in different citation range. Once again, the average 
size of the circles increases with the increase of citations per 
author. It essentially indicates that for the highly-cited egos, 
the alters are not only high in number but also form large 
cohesive groups. 

Since each author is also an alter in her neighbors’ ego 
networks, she might be a part of multiple such ego-centric 
circles. Figure He) shows the number of such ego-centric 
circles to which an author belongs to. This plot highly cor¬ 
relates with Figure Ha'); and shows that since highly-cited 
authors have more number of alters in their ego networks, 
each of them also belongs to multiple local circles of her 
neighbors’ ego networks. 

Further, we measure the degree of cliquishness (the ratio of 
the number of existing edges in the circle and the maximum 
number of possible edges in the circle) of each ego-centric cir¬ 
cle. For each ego, we measure the average cliquishness of her 
surrounding circles in the ego network. Figure H^) shows 
that the average value of cliquishness initially decreases with 
the increase of the number of citations per author, then it 
starts increasing. The reason could be that since both the 
number and the size of the ego-centric circles for low-cited 
authors are less, the maximum number of possible edges 
within a circle is also less, which in turn acts as the reason 
of high cliquishness for low-cited authors. In the middle cita¬ 
tion zone, both the number and the average size of the circles 
are moderate. However, the number of edges that material¬ 
ize within these circles is less as compared to the maximum 
number of possible edges, thus accounting for the sparseness 
of these circles. Therefore, the value of cliquishness of cir¬ 
cles spawned by authors in the middle range of citations is 
comparatively low. However, the value of cliquishness starts 
increasing for the authors having citations more than 100. 
















Figure 5: (Color online) Circle-specific characteristics of detected ego-centric circles: (a) distribution of 
the size of the ego-centric circles, (b) distribution of the cliquishness of the ego-centric circles, (c) average 
cliquishness of the circles having a particular size, and (d) percentage of egos surrounded by a particular 
number of ego-centric circles. 


This signifies that for the highly-cited authors, despite the 
apparently large size of circles, the probability that an edge 
actually materializes in the real network tends to increase. 
This explains the formation of dense ego-centric cliques sur¬ 
rounding the ego in the high-citation range. 

8.2 Circle-specific Analysis 

Now we look into some of the characteristic features spe¬ 
cific to an ego-centric circle. In Figure[5la), we plot the per¬ 
centage of ego-centric circles having a particular size Sc- It 
follows a Gaussian distribution at the beginning along with 
a heavy tail at the end. We observe that around 65.26% 
circles have sizes ranging between 4-30. However, the flat 
tail at the end shows that more than 15% circles have size 
greater than 50. Figure Ob) shows the distribution of the 
cliquishness (Cc) of the ego-centric circles. Surprisingly, it 
again follows a Gaussian distribution with mean ~ 0.44 and 
variance ~ 0.02. We notice that around 59.28% circles have 
cliquishness values ranging between [0.4, 0.6] which is quite 
high. Further inspection reveals that low-degree egos are 
surrounded by small-size circles and therefore their cliquish¬ 
ness value is quite high. To get a clear idea of the relation be¬ 
tween the size and the cliquislmess of the ego-centric circles, 
we plot in Figure [Sjc) the average cliquishness of the circles 
having a specific size. The value of cliquishness Cc gradually 
decreases with the increase of the size Sc till So=40, which 
is followed by a sharp increase. As mentioned earlier, the 
increase of cliquishness at the end once again emphasizes 
that the large-size circles centered around the highly-cited 
authors are relatively dense. Finally, we plot the percentage 
of egos surrounded by a specific number of circles in Figure 
ETd). As expected, we observe that the plot has a declining 
trend from the very beginning, which once again highlights 
our previous observation that most of the low-cited authors 
have a low degree in the co-authorship network, and spawn 
only a few ego-centric circles. Since the co-authorship net¬ 
work is mostly dominated by low-degree authors, most of 
the egos are fringed by a small number of local circles. 

9. INTERPRETATION OF EGO-CENTRIC 
CIRCLES 

In co-authorship network, most intuitive ground-truth com¬ 
munities are often assumed to be different areas of research 
[7] in a particular domain. Therefore, one can interpret each 
ego-centric circle as a group of coauthors working in a spe¬ 
cific research area. Since we know the major research area 
of each author in the dataset, for each ego we further group 
its coauthors based on only their major research area such 
that each circle corresponds to an area and constitutes coau- 



Authors (arranged in decreasing order of the 
cliquishness of circles detected by our model) 



Range 1 Range 2 Range 3 Range 4 
Different citation ranges of authors 


Figure 6: (Color online) (a) Comparison of cliquish¬ 
ness of area-based circles and the circles detected 
by our model, (b) average homogeneity of the ego¬ 
centric circles detected by our model for the authors 
categorized into four zones as per number of cita¬ 
tions: Range 1 (>200), Range 2 (>100 & <=200), 
Range 3 (>30 & <=100) and Range 4 (<=30). 


thors working on this area. Then we measure the average 
cliquishness of the research field based circles for each au¬ 
thor vis-a-vis that of the ego-centric circles detected by our 
model. Essentially, we intend to cross-validate our hypothe¬ 
sis that considering a single dimensional feature vector of an 
author such as the field information is not an appropriate 
way of encircling alters; rather each circle might represent 
individual dimension of cohesiveness among its constituent 
nodes as shown in Figure [T] In Figure [^a), we plot the av¬ 
erage cliquishness of field-based circles vis-a-vis that of the 
circles detected by the model surrounding each author. As 
expected, the cliquishness of the detected circles is signifi¬ 
cantly higher than that of the field-based circles. Therefore, 
we conclude that the field-based circles might not appropri¬ 
ately group highly cohesive nodes, rather the circles detected 
by our model seem to be more representative and meaning¬ 
ful. 

We further mark each of the detected circles by that field 
which is also the major research area for most of its con¬ 
stituent coauthors. Then for each ego, we measure the frac¬ 
tion of circles belonging to each of the 24 fields. There¬ 
fore, each ego/author can now be represented by a vector 
of size 24 whose entry represents the fraction of ego¬ 
centric circles marked by field i. Figure [7] shows three heat 
maps corresponding to highly-cited, medium-cited and low- 
cited authors. For the sake of brevity, we only plot val¬ 
ues for 1000 authors from each citation range although the 
results are similar for other authors. We observe that for 
highly-cited authors, ego-centric circles are mostly marked 
by few fields, which indicates that the highly-cited authors 
tend to collaborate with people from similar research area. 
If this is true, then the immediate question would be why 























24 fields in computer science 


Figure 7: (Color online) Heat maps representing 
the fraction of ego-centric circles marked by each 
of the 24 fields for highly-cited, medium-cited and 
low-cited authors. For each author, the elements in 
the horizontal axis are sorted in descending order. 


the coauthors having same research interest are encircled 
into different groups by our model. Further inspection re¬ 
veals that along with the field, each group also represents 
the time of collaboration. For instance, the ego network of 
Author 10 (one of the highly-cited authors) is shown in Fig¬ 
ure Ha). One group of her ego network encircles authors 
in Data Mining who have coauthored with her during 1997- 
2000. Another such group constitutes authors from Machine 
Learning, who have collaborated with her during 2000-2003. 
Therefore, the field of research and the time of collaboration 
act as two major dimensions in this case. 

Next for the medium-cited authors, the heat map in Fig¬ 
ure Hb) shows that the distribution of circles into different 
fields seems to be much more uniform as compared to Figure 
Ha). The example shown in Figure Hb) also corroborates 
with the hypothesis that with the decrease of citations, the 
ego-centric circles tend to become even more complicated 
to be interpreted distinctly. From Figure Hb), we notice 
that the time duration of collaboration corresponding to 
the circles are overlapping and, therefore, it is very hard 
to distinguish these circles. The result is even more clut¬ 
tered for low-cited authors as shown in Figures He) and 
He). These results thus lead to a general conclusion that 
the highly-cited authors seem to coauthor with a group of 
people having a specific research interest in a particular time 
period and then move to another such group of coauthors; 
whereas this tendency is not so prominent for medium- and 
low-cited authors. However, we posit that there might be 
other dimensions (such as the name of the Institute where 
the author belongs to) that might help us interpret these 
circles more clearly. 


fields as in Equation m Greater the entropy, lesser is the 
homogeneity and vice versa. 


1 - ESi t"c,Dog(J’c,i) 


(16) 


FigureHb) captures the average homogeneity of circles in 
the ego-network of authors in four different citation ranges. 
We note that the homogeneity is highest for the authors 
with very high citation ranges (> 200) and has low variance, 
indicating that highly-cited authors tend to spawn circles 
that have alters in very similar fields, whereas authors with 
medium citations (30—100) spawn more diverse circles. The 
authors with low citations (< 30) exhibit higher degree of 
homogeneity than those in the medium range, but this may 
be attributed to the fact that they spawn very small-sized 
circles. 


10. TASK BASED EVALUATION 

We further evaluate the quality of the circles through a 
task based evaluation framework - the task of collaboration 
prediction. We choose two supervised learning models: lin¬ 
ear regression (LR) [4] and supervised random walks (SRW) 
H. Then we demonstrate that inclusion of the ego-centric 
circles detected by our model as a feature in the feature 
set would eventually enhance the performance of this model 
with respect to the one in which the circle information is 
missing. 

10.1 Problem Definition 

For our problem, we assume a temporal graph Gt = 
(Vt, Et) where Vt represents a set of nodes such that each 
node ut G Vt is associated with a time stamp t indicating 
its first appearance in Gt, and each edge = {uti,Vtj) 

connects two nodes ut- and vtj (such that ut^ ,Vtj G Vt and 
ti <= tj). Each node ut is also associated with a feature 
vector fut at time stamp t, whose entires might change over 
time. Now, given a longitudinal snapshot of the graph Gt 
from the beginning till time T , say Gj,i = {Vj,i, the 

collaboration prediction problem aims at predicting the col¬ 
laborations which are going to appear among the vertices in 
Vj,/ within At time period after T . 

This task is very challenging due to extreme sparsity of 
real networks where each node is connected to only a very 
small fraction of all other nodes in the network (the presence 
of high proportion of negative evidences in the dataset). 


Homogeneity of ego-centric circles. We define a field- 
based homogeneity for ego-centric circles to verify if, in most 
cases, authors from the same field tend to form communi¬ 
ties and whether the circles spawned by our unsupervised 
approach are able to capture this tendency. Given a circle 
G, we define Ec,i to be the fraction of authors in C with 
major field fi. One can easily infer that a uniform distribu¬ 
tion Fc,i implies that the circle is homogeneous with respect 
to the field of work while a skewed distribution (with ma¬ 
jority of authors in one or two fields) characterizes a more 
field-specific circle. In particular, we define the homogene¬ 
ity coefficient H{C) for circle C in terms of the entropy of 
the circle with respect to the distribution across different 


^The names of the authors are anonymized in order to main¬ 
tain privacy. 


10.2 Feature Set 

We use a set of node- and edge-level features for the learn¬ 
ing models. The following set of node-level features (denoted 
by N) are used. Each feature is normalized by the maximum 
value of the corresponding feature so that the values range 
between 0 to 1. 

• Normalized number of citations received by an author 

• Normalized h-index of an author 

• Normalized number of coauthors of an author 

• Fraction of papers by an author in each of the 24 fields 

• Normalized number of papers written by an author 

• Fraction of papers published by an author in each of the 
five decades (between 1960-2009) 

Further, given an edge e = {x, y) in the co-authorship net¬ 
work, we additionally use the following edge-level features 
(denoted by E). Each feature is appropriately normalized 













Figure 8: (Color online) Examples of ego-networks from three citation zones. Individual nodes have different 
colors corresponding to different areas of research. If the color of all the nodes in a circle is same as the color 
of the circle, the value of homogeneity is 1. Time period (t; — tj) associated with each circle indicates that 
the ego has written a paper first (last) time with anyone of its constituent coauthors at year ti (tj). 


to a value between 0 and 1. 

• Fraction of papers coauthored by x and y in each of the 
five decades 

• Normalized number of common coauthors of x and y 

• Fraction of papers authored by x in the major field of y 

• Fraction of papers authored by y in the major field of x 
We refer to the combined set of both node- and edge- 

level features by NE. We provide this set NE of node and 
edge attributes as an input to the learning model which then 
takes care of determining how to combine them with the 
network structure to make predictions [3]. Note that if we 
take the dataset till t for training the model, all the features 
mentioned above will be calculated based on the statistics 
of each vertex till t in order to avoid information leakage. 

10.3 Evaluation Methodology 

In order to demonstrate that predictions are robust ir¬ 
respective of the time stamp considered for dividing the 
dataset into training and test sets, we run the competing 
models in three different time periods: (i) the dataset till 
1995 is considered for training and the accuracies of the 
models are measured by comparing the new edges formed 
between 1996-1999, (ii) similarly, the dataset till 2000 for 
training and 2001-2004 for checking the accuracy, and (iii) 
the dataset till 2005 for training and 2006-2009 for checking 
the accuracy. 

In each time stamp, we evaluate the methods on the test 
set, considering two performance metrics: the Area under 
the ROC curve {AUC) and the Precision at Top 20 (Prec@20), 
i.e., for each node s, what fraction of top 20 nodes sug¬ 
gested by each model actually receive links from s later. 
This measure is particularly appropriate in the context of 
link-recommendation where we present a user with a set of 
suggested coauthors and aim that most of them are correct. 

10.4 Performance Evaluation 

We compare the predictive performance of two learning 
models including the circle information in three different 
time periods as mentioned in Section [10.31 We iterate each 
of these collaboration prediction models using different sets 
of features: (i) only node-level features {Model: N), (ii) only 
edge-level features {Model: E), (iii) both node and edge level 
features {Model: NE), (iv) besides node and edge level fea¬ 
tures, including a binary feature B that checks whether a 
pair of nodes (a;, y) belong to at least one common ego¬ 
centric circle or not {Model: NEB), and (v) besides node¬ 
level and edge-level features and the binary circle informa¬ 
tion, including a numeric feature C indicating the number 


of common circles a pair of nodes {x,y) is a part of {Model: 
NEBC). The circles are detected by our model, the coordi¬ 
nate ascent method (CA) |15ill6 | and BIGCLAM separately. 

Table [1] shows the performance of these two prediction 
models with different feature sets. We notice that edge fea¬ 
tures are more effective than node features, and the per¬ 
formance improves incrementally after combining different 
features together. A general observation is that inclusion 
of circle information in the feature set improves the perfor¬ 
mance of both the prediction models irrespective of the time 
periods. For instance, it improves the performance by 9.87% 
and 15.25% on average in terms of AUC and Prec@20 re¬ 
spectively compared to the case, where the circle information 
is not present {NE). 

We further observe that the inclusion of circle information 
detected by our model significantly outperforms the case 
where the circles are obtained by BIGCLAM and GA in each 
time stamp. Including the binary circle information {NEB) 
from our model achieves an average AUG improvement of 
2.16% and 3.51% respectively for LR and SRW models (sim¬ 
ilarly, in terms of Prec@20, the improvement is 3.75% and 
2.94% respectively for LR and SRW models) compared to 
BIGCLAM (CA). 

Further, including the count of common circles for a node 
pair {NEBC) in the feature set leads both LR and SRW 
to achieve even better performance. We observe an aver¬ 
age AUC improvement of 3.41% (1.11%) and 3.31% (0.57%) 
respectively for LR and SRW models using our circle infor¬ 
mation as compared to that obtained from BIGCLAM (CA) 
(similarly, in terms of Prec@20, the improvement is 6.35% 
(5.14%) and 6.16% (3.22%) respectively for LR and SRW 
models). 

11. CONCLUSIONS AND FUTURE WORK 

Circles allow us to organize the overwhelming volumes of 
data generated by an author’s personal academic network. 
In this work, we proposed a simple yet effective method 
of detecting ego-centric circles in co-authorship networks. 
However, the proposed method is applicable to any general 
ego network given a suitable set of features. Our model is un¬ 
supervised and combines node attributes and node similari¬ 
ties to identify circles that resemble communities in real net¬ 
works. Experiments with four state-of-the-art overlapping 
community detection algorithms showed that our method 
outperformed these baseline algorithms. Further, a task 
based evaluation achieved a superior performance after in¬ 
clusion of the circle information detected by our model. 

In future, we would like to develop a semi-supervised ver¬ 
sion of our algorithm that makes use of manually labeled 





Table 1: Comparison of BIGCLAM (BIG), coordinate ascent method (GA) |15L I16| and our model (CIRG) 
after including their detected circle information into the feature set of Linear Regression (LR) and Supervised 


Random Walks (SRW) frameworks across three time periods and different feature sets (N: node-level, E: 
edge-level, NE: node- and edge-level, NEB: adding the binary circle information to NE, NEBC: adding the 
numerical circle information to NEB). 



Area Under the R.0C Curve (AuC) 

Time 

Linear Regression (LR.) 

Supervised Random Walks (SRW) 

period 

N 


NE 

NEB 

NEBC 



NE 

NEB 

NEBC 



^IG 

CA 

CIRC 

BIG 

CA 

CIRC 



BIG 

CA 

CIRC 

BIG 

CA 

CIRC 

1996-1999 

0.5872 

0.5914 

0.6451 

n.6.569 

0.6689 

0.6791 

0.6989 

0.7195 

0.7235 

0.6332 

0.6478 

0.7659 

0.7908 

0.7895 

0.8275 

0.7971 

0.8296 

0.8303 

2001-2004 

0.5890 

0.5907 

0.6528 

0.6529 

0.6437 

0.6659 

0.6845 

0.7011 

0.7012 

0.6419 

0.6514 

0.7591 

0.8067 

0.8035 

0.8249 

0.8098 

0.8149 

0.8356 

2006-2009 

0.5916 

0.5891 

0.6436 

0.6439 

0.6510 

0.6509 

0.6905 

0.7001 

0.7198 

0.6360 

0.6608 

0.7609 

0.8001 

0.8101 

0.8295 

0.8111 

0.8279 

0.8321 

Average 

0.5893 

0.5904 

0.6472 

0.6512 

0.6545 

0.6653 

0.6913 

0.7069 

0.7148 

0.6370 

0.6533 

0.7620 

0.7992 

0.8101 

0.8273 

0.8060 

0.8279 

0.8327 




Pre 

@20 


Time 

Linear Regression (LR.) 

Supervised Random Walks (SRW) 

period 

N 


NE 

NEB 

NEBC 



NE 

NEB 

NEBG 



BIG 

CA 

CIRC 

BIG 

CA 

CIRC 



BIG 

CA 

CIRC 

BIG 

CA 

CIRC 

1996-1999 

0.137 

0.124 

0.152 

0.155 

0.161 

0.158 

0.164 

0.173 

0.177 

0.165 

0.172 

0.201 

0.205 

0.209 

0.210 

0.207 

0.215 

0.223 

2001-2004 

0.141 

0.143 

0.156 

0.162 

0.159 

0.169 

0.175 

0.175 

0.185 

0.158 

0.163 

0.198 

0.200 

0.210 

0.209 

0.215 

0.220 

0.225 

2006-2009 

0.147 

0.142 

0.161 

0.162 

0.165 

0.171 

0.179 

0.178 

0.189 

0.161 

0.169 

0.199 

0.208 

0.209 

0.212 

0.211 

0.217 

0.224 

Average 

0.142 

0.136 

0.156 

0.160 

0.162 

0.166 

0.173 

0.175 

0.184 

0.161 

0.168 

0.199 

0.204 

0.209 

0.210 

0.211 

0.217 

0.224 


data. Although most authors may not want to label the 
circles manually, it would be highly desirable to make use 
of the information from those few who do. Additionally, we 
would also like to apply the proposed method on the other 
datasets. 
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